good data
Good Data Is All Imitation Learning Needs
Samadi, Amir, Koufos, Konstantinos, Debattista, Kurt, Dianati, Mehrdad
In this paper, we address the limitations of traditional teacher-student models, imitation learning, and behaviour cloning in the context of Autonomous/Automated Driving Systems (ADS), where these methods often struggle with incomplete coverage of real-world scenarios. To enhance the robustness of such models, we introduce the use of Counterfactual Explanations (CFEs) as a novel data augmentation technique for end-to-end ADS. CFEs, by generating training samples near decision boundaries through minimal input modifications, lead to a more comprehensive representation of expert driver strategies, particularly in safety-critical scenarios. This approach can therefore help improve the model's ability to handle rare and challenging driving events, such as anticipating darting out pedestrians, ultimately leading to safer and more trustworthy decision-making for ADS. Our experiments in the CARLA simulator demonstrate that CF-Driver outperforms the current state-of-the-art method, achieving a higher driving score and lower infraction rates. Specifically, CF-Driver attains a driving score of 84.2, surpassing the previous best model by 15.02 percentage points. These results highlight the effectiveness of incorporating CFEs in training end-to-end ADS. To foster further research, the CF-Driver code is made publicly available.
Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism
Park, Chanjun, Khang, Minsoo, Kim, Dahyun
This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.
AI Is Running Circles Around Robotics
When people imagine the AI apocalypse, they generally imagine robots. But the robot-takeover scenario most often envisioned by science fiction is not exactly looming. Recent and explosive progress in AI--along with recent and explosive hype surrounding it--has made the existential risks posed by the technology a topic of mainstream conversation. Yet progress in robotics--which is to say, machines capable of interacting with the physical world through motion and perception--has been lagging way behind. "I can't help but feel a little envious," said Eric Jang, the vice president of AI at the humanoid-robotics company 1X, in a talk at a robotics conference last year.
Cleanlab: Correct your data labels automatically and quickly – Towards AI
Originally published on Towards AI. I used an open-sourced library, cleanlab, to remove low-quality labels on an image dataset. The model trained on the dataset without low-quality data gained 4 percentage points of accuracy compared to the baseline model (trained on all data). Improving data quality sounds easy enough. But the workload of manually checking data quality can quickly become insurmountable as the dataset scales.
Good Data from Bad Models : Foundations of Threshold-based Auto-labeling
Vishwakarma, Harit, Lin, Heguang, Sala, Frederic, Vinayak, Ramya Korlakai
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.
Efficient Medical Image Assessment via Self-supervised Learning
Huang, Chun-Yin, Lei, Qi, Li, Xiaoxiao
High-performance deep learning methods typically rely on large annotated training datasets, which are difficult to obtain in many clinical applications due to the high cost of medical image labeling. Existing data assessment methods commonly require knowing the labels in advance, which are not feasible to achieve our goal of 'knowing which data to label.' To this end, we formulate and propose a novel and efficient data assessment strategy, EXponentiAl Marginal sINgular valuE (EXAMINE) score, to rank the quality of unlabeled medical image data based on their useful latent representations extracted via Self-supervised Learning (SSL) networks. Motivated by theoretical implication of SSL embedding space, we leverage a Masked Autoencoder for feature extraction. Furthermore, we evaluate data quality based on the marginal change of the largest singular value after excluding the data point in the dataset. We conduct extensive experiments on a pathology dataset. Our results indicate the effectiveness and efficiency of our proposed methods for selecting the most valuable data to label.
Pinaki Laskar on LinkedIn: #MLOps #AI #machinelearning
AI Researcher, Cognitive Technologist Inventor - AI Thinking, Think Chain Innovator - AIOT, XAI, Autonomous Cars, IIOT Founder Fisheyebox Spatial Computing Savant, Transformative Leader, Industry X.0 Practitioner Why #MLOps is the key for productionized ML system? ML model code is only a small part ( 5–10%) of a successful ML system, and the objective should be to create value by placing ML models into production. F1 score) while stakeholders focus on business metrics (e.g. Improving labelling consistency is an iterative process, so consider repeating the process until disagreements are resolved as far as possible. For instance, partial automation with a human in the loop can be an ideal design for AI-based interpretation of medical scans, with human judgement coming in for cases where prediction confidence is low.
Smart Water: Data Labeling with Active Learning And H2O.ai
Data is the food for AI. For machine Learning, or supervised learning, the golden labels are key for the models to recognize the pattern within the data. However, in the real-world data, it is usually hard to get large amount of labeled data, for example, search revelance, news topics, autopilot, etc. Recently, Angrew Ng gave a talk on MLOps: From Model-centric to Data-centric AI, where he mentioned the Idea from Big Data to Good Data. Good data is defined consistently and cover of important cases.
AI Ethics Needs Good Data
Daly, Angela, Devitt, S Kate, Mann, Monique
In this chapter we argue that discourses on AI must transcend the language of 'ethics' and engage with power and political economy in order to constitute 'Good Data'. In particular, we must move beyond the depoliticised language of 'ethics' currently deployed (Wagner 2018) in determining whether AI is 'good' given the limitations of ethics as a frame through which AI issues can be viewed. In order to circumvent these limits, we use instead the language and conceptualisation of 'Good Data', as a more expansive term to elucidate the values, rights and interests at stake when it comes to AI's development and deployment, as well as that of other digital technologies. Good Data considerations move beyond recurring themes of data protection/privacy and the FAT (fairness, transparency and accountability) movement to include explicit political economy critiques of power. Instead of yet more ethics principles (that tend to say the same or similar things anyway), we offer four 'pillars' on which Good Data AI can be built: community, rights, usability and politics. Overall we view AI's 'goodness' as an explicly political (economy) question of power and one which is always related to the degree which AI is created and used to increase the wellbeing of society and especially to increase the power of the most marginalized and disenfranchised. We offer recommendations and remedies towards implementing 'better' approaches towards AI. Our strategies enable a different (but complementary) kind of evaluation of AI as part of the broader socio-technical systems in which AI is built and deployed.
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (10 more...)
- Social Sector (1.00)
- Law > Statutes (1.00)
- Information Technology > Security & Privacy (1.00)
- (2 more...)